Towards Distribution of Web Sites in a Crawler Used for Large Scale Web Accessibility Assessment
نویسندگان
چکیده
A mechanisms used for large scale accessibility measuring may involve a distributed web crawler. Furthermore, it makes sense to spread the web sites involved to di erent access points (crawler locations / crawler nodes) of the distributed crawler. We will in this publication present an algorithm utilising the available resources to a much greater extent than the traditional uniform distribution of web sites. Our novel algorithm, namely the Time Weighted Object Migration Automaton (TWOMA), is an extension on the Object Migration Automaton (OMA) presented in [1]. The heart of our scheme involves continuously accessing web sites while measuring the duration of each access. Note that accessing a site involves downloading and measuring the accessibility. When a web site is accessed the following happens; If the duration of accessing the web site is less than the average duration for all web sites in the corresponding accesspoint, the web site is moved one state closer to the most internal state of this access point. If the duration of accessing the web site is more than the average duration in the corresponding accesspoint, the web site is moved one state closer to the boundary state of this access point. If the site is already located in the boundary state, the site is moved to another random access point. The above scheme is repeated as long as the crawling / measurement is ongoing. This ensures that the scheme works in a dynamic environment (as the real web). Furthermore, we will in this publication show that the algorithm is working towards an optimal distribtion of web sites in available access points using experimental data.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملPositioning of Industries in Cyberspace Evaluation of Web Sites Using Correspondence Analysis
In today’s extremely competitive markets it is crucial for companies to strategically position their brands, products and services relative to their competitors. With the emerging trend in internationalization of companies especially SME’s and the growing use of the Internet with this regard, great amount of attention has been turned to effective involvement of the Internet channel in the mar...
متن کاملCrawling the Web: Discovery and Maintenance of Large-scale Web Data
This dissertation studies the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putt...
متن کاملEarly Results from Automatic Accessibility Benchmarking of Publ
Benchmarking of web accessibility is performed throughout Europe, to assess and raise awareness of web accessibility. The evaluation is often based on manual assessments with a high cost and with long intervals. The Web Content Accessibility Guidelines from W3C/WAI are the basis of most evaluations. Although the same guidelines are used, a range of different evaluation methodologies and scoring...
متن کاملReliability, Readability and Quality of Online Information about Femoracetabular Impingement
Background: The Internet has become the most widely-used source for patients seeking information more about their health and many sites geared towards this audience have gained widespread use in recent years. Additionally, many healthcare institutions publish their own patient-education web sites with information regarding common conditions. Little is known about how these resources impact pati...
متن کامل